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Abstract 

The paper discusses an Amharic speaker independent contin- 
uous speech recognizer based on an HMM/ANN hybrid ap- 
proach. The model was constructed at a context dependent 
phone part sub- word level with the help of the CSLU Toolkit. A 
promising result of 74.28% word and 39.70% sentence recog- 
nition rate was achieved. These are the best figures reported so 
far for speech recognition for the Amharic language. 

1. Introduction 

The general objective of the present research was to examine 
and demonstrate the performance of a hybrid HMM/ANN sys- 
tem for a speaker independent continuous Amharic speech re- 
cognizer. Amharic is the official language of communication 
for the federal government of Ethiopia and is today probably the 
second largest language in the country (after Oromo) and quite 
possibly one of the five largest on the African continent. It is 
estimated to be mother tongue of more than 17 million people, 
with at least an additional 5 millions of second language speak- 
ers. Still, just as for many other African languages, Amharic 
has received preciously little attention by the speech process- 
ing research community; even though the last years have seen 
an increasing trend to investigate applying speech technology to 
other languages than English, most of the work is still done on 
very few and mainly European and East- Asian languages. 

The Ethiopian culture is ancient, and so are the written lan- 
guages of the area, with Amharic using its very own script. This 
has caused some problems in the digital age and even though 
there are several computer fonts for Amharic, and an encoding 
of Amharic was incorporated into Unicode in 2000, the langu- 
age still has no widely accepted computer representation. In 
recent years there has been an increasing awareness of that Am- 
haric speech and language processing resources must be created 
as well as digital information access and storage. 

The present paper is a step in that direction. It is laid out 
as follows: Section 2 introduces the HMM/ANN hybrid ASR 
paradigm. Section 3 discusses various aspects of Amharic and 
some previous efforts to apply speech technology to the langu- 
age. Then Section 4 describes the actual experiments with con- 
structing, evaluating, and testing an Amharic Automatic Speech 
Recognition System using the CSLU Toolkit [1]. 

2. HMM/ANN hybrids 

Commonly, HMM-based speech recognizers have shown the 
best performance. On the positive side this dominant paradigm 
is based on a rich mathematical framework which allows for 
powerful learning and decoding methods. In particular, HMMs 



are excellent at treating temporal aspects by providing good ab- 
stractions for sequences and a flexible topology for statistical 
phonology and syntax. However, HMMs have some drawbacks, 
especially for large vocabulary speaker independent continuous 
ASR. The main disadvantage is a relatively poor discrimina- 
tion power. In addition HMMs enforce some practical require- 
ments for distributional assumptions (e.g., uncorrelated features 
within an acoustic vector) and typically make first order Markov 
model assumptions for phone or sub-phone states while ignor- 
ing the correlation between acoustic vectors [2]. 

In effect, HMMs adopt a hierarchical scheme modeling a 
sentence as a sequence of words, and each word as a sequence 
of sub-word units. An HMM can be defined as a stochastic fi- 
nite state automaton, usually with a left-to-right topology when 
used for speech. Each probability is approximated based on 
maximum likelihood techniques. Still, these techniques have 
been observed for poor discrimination, since they maximize the 
likelihood of each individual node independently from the other. 
On the other hand neural network classifiers have shown good 
discrimination power, typically requires fewer assumptions, and 
can easily be integrated in non-adaptive architectures. This is 
the point behind changing the pure HMM approach to the hy- 
brid HMM/ANN model, by using an ANN to augment the ASR 
system [3]. The HMM is used as the main structure of the 
system to cope with the temporal alignment properties of the 
Viterbi algorithm, while the ANN is used in a specific subsys- 
tem of the recognizer to address static classification tasks. This 
has shown performance improvement over pure HMM: Fritsch 
& Finke [4] describe a tree-structural hierarchical HMM/ANN 
system which outperformed HMM on Switchboard. 

In an HMM/ANN model a neural network of multi-layered 
perceptrons is given an input vector of acoustic observation 
values, ot and computes a vector of output values which are 
approximate a-posteriori state probabilities. Commonly, nine 
frames are given for the input of the network: four consecu- 
tive frames before, four frames after, and one frame at time t, 
in order to provide the ANN with more contextual data. Then 
the network will have one output for each phone by restricting 
the sum of all the output units to one. This helps to calculate the 
a-posteriori probability, qj of a state j conditioned on the acous- 
tic input: p(cij\ot). Generally an ASR system has a front end 
in which the natural speech wave is digitized and parameterized 
for the recognizer. The recognizer has a neural net to train on 
these digitized and parameterized data. After training, the neu- 
ral net produces the estimation of probabilities of observations 
for the HMM states. The HMM uses these probabilities and 
the language model to compute the probability of a sequence of 
symbols given the observation sequence. Finally, the recognizer 
uses decoders to generate the recognized symbols as output. 
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3. Amharic Speech Processing 

Ethiopia is witli about 70 million inliabitants tlie third most pop- 
ulous African country and harbours some 80 different langu- 
ages. Three of these are dominant: Oromo, a Cushitic langu- 
age is spoken in the South and Central parts of the country and 
written using the Latin alphabet; Tigrinya, spoken in the North 
and in neighbouring Eritrea; and Amharic, spoken in most parts 
of the country, but predominantly in the Eastern, Western, and 
Central regions. Amharic and Tigrinya are Semitic languages 
and thus distantly related to Arabic and Hebrew. 

3.1. The Amharic language 

Following the Constitution of 1994, Ethiopia is a divided into 
nine fairly independent regions, each with its own nationality 
language. However, Amharic is the language for country-wide 
communication and was also for a long period the principal lan- 
guage for literature and the medium of instruction in primary 
and secondary schools of the country (while higher education 
is carried out in English). Amharic speakers are mainly Ortho- 
dox Christians, with Amharic and Tigrinya drawing common 
roots to the ecclesiastic Ge'ez still used by the Coptic church 
— both languages are written horizontally and left-to-right us- 
ing the Ge'ez script. Written Ge'ez can be traced back to at least 
the 4th century A.D. The first versions of the language included 
consonants only, while the characters in later versions represent 
consonant- vowel (CV) phoneme pairs. 

Amharic words use consonantal roots with vowel varia- 
tion expressing difference in interpretation. In modern written 
Amharic, each syllable pattern comes in seven different forms 
(called orders), reflecting the seven vowel sounds. The first or- 
der is the basic form; the other orders are derived from it by 
more or less regular modifications indicating the different vow- 
els. There are 33 basic forms, giving 7 * 33 syllable patterns 
(syllographs), or jidEls. Two of the base forms represent vowels 
in isolation (o and h), but the rest are for consonants (or semi- 
vowels classed as consonants) and thus correspond to CV pairs, 
with the first order being the base symbol with no explicit vowel 
indicator (though a vowel is pronounced: C-1-/9/). The writing 
system also includes four (incomplete, five-character) orders of 
labialised velars and 24 additional labialised consonants. In to- 
tal, there are 215 fidEls. See, e.g., [5] for an introduction to the 
Ethiopian writing system. 

The Amharic writing system uses multitudes of ways to de- 
note compound words and there is no agreed upon spelling stan- 
dard for compounds. As a result of this — and of the size of 
the country leading to vast dialectal dispersion — lexical vari- 
ation and homophony is very common. In addition, not all the 
letters of the Amharic script are strictly necessary for the pro- 
nunciation patterns of the spoken language; some were simply 
inherited from Ge'ez without having any semantic or phonetic 
distinction in modern Amharic. There are many cases where 
numerous symbols are used to denote a single phoneme, as well 
as words that have extremely different orthographic form and 
slightly distinct phonetics, but with the same meaning. So are, 
for example, most labialised consonants basically redundant, 
and there are actually only 39 context-independent phonemes 
(monophones): of the 275 symbols of the script, only about 233 
remain if the redundant ones are removed. 

In contrast to the character redundancy, there is no mecha- 
nism in the Amharic writing system to mark gemination of con- 
sonants. The words /wbhb/ (swimming) and /wbhrb/ (main, 
core) are both written as <PT, but give two completely different 
meanings by geminating the consonant '> /n/. This requires dif- 



ferent reference models in the database for the multiple forms 
of the sound depending on the gemination. (Another problem 
is an ambiguity with the 6th order characters: whether they are 
vowelled or not. However, this is not relevant to this work.) 

3.2. Previous work 

This study aims at investigating and testing out the possibility 
of developing speaker independent continuous Amharic speech 
recognition systems using a hybrid of HMM and ANN systems. 
Speech and language technology for the languages of Ethiopia 
is still very much unchartered territory; however, on the lan- 
guage processing side some initial work has been carried out, 
mainly on Amharic word formation and information access. 
See [6] or [7] for short overviews of the efforts that have been 
made so far to develop language processing tools for Amharic. 

Research conducted on speech technology for Ethiopian 
languages has been even more limited. Laine [8] made a valu- 
able effort to develop an Amharic text- to- speech synthesis sys- 
tem, and Tesfay [9] did similar work for Tigrinya.' Solomon 
[10] built speaker dependent and speaker independent HMM- 
based isolated consonant-vowel syllable recognition systems 
for Amharic. He proposed that CV-syllables would be the best 
candidates for the basic recognition units for Amharic. 

Solomon's work was extended by Kinfe [11] who used the 
HTK Toolkit to build HMM word recognizers at three different 
sub-word levels: phoneme, tied-state triphone, and CV-syllable. 
Kinfe collected a 170 word vocabulary from 20 speakers. He 
considered a subset of the Amharic syllables, concentrating on 
the combination of 20 phonemes with the seven vowels, or in to- 
tal 140 CV-units. Kinfe's training and test sets both consisted of 
50 discrete words. Contrary to Solomon's predictions, the per- 
formance of the syllable- level recognition was very bad (for un- 
clear reasons) and Kinfe abandoned it in favour of the phoneme- 
and triphone-based recognizers. For the latter two he reports an 
isolated word recognition accuracy of 83.1% resp. 78.0% on 
speaker dependent models, while the speaker independent mod- 
els gave 75.5% for phoneme-based models and 77.9% isolated 
word accuracy for tied-state triphone models. 

Molalgne [12] tried to compare HMM- based small vocabu- 
lary speaker-specific continuous speech recognizers built using 
three different toolkits: CSLU, HTK, and MSSTATE Toolkit 
from Mississippi State, but failed in setting up CSLU so that 
only two toolkits were actually tested. He collected a corpus of 
50 sentences with ten words (the digits) from a single speaker. 
While HTK was clearly faster than MSSTATE, the speaker dep- 
endent recognition performance for both systems was compara- 
ble with 82.5% resp. 79.0% word accuracy and 72.5% resp. 
67.5% sentence accuracy for HTK resp. MSSTATE. 

Martha [13] worked on a small vocabulary isolated word 
recognizer for a command and control interface to Microsoft 
Word, while Zegaye [14] continued the work on speaker indep- 
endent continuous Amharic ASR. He used a pure HMM-based 
approach and reached 76.2% word accuracy and 26.1% sen- 
tence level accuracy. However, there are still a lot of work 
to be done towards achieving a full-fledged automatic Amha- 
ric speech recognition system. The intention of the present re- 
search was to use an HMM/ ANN hybrid model approach as an 
alternative for better performance. For this we utilized an im- 
plementation of such a model in the CSLU Toolkit. 



In the text we follow the practice of referring to Ethiopians by their 
given names. However, the reference list follows European standard 
and also gives surnames (i.e., the father's given name for an Ethiopian). 
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4. An Amharic SR system 

The attempt of this research is to design a prototype speech 
recognizer for the Amharic language. The recognizer uses 
phonemes as base units and is designed to recognize continu- 
ous speech and is speaker independent. In contrast to the pure 
HMM-based work done by Zegaye [14], the system implements 
the HMM/ANN hybrid model approach. The development pro- 
cess was performed using the CSLU Toolkit installed on the 
Microsoft Windows 2000 platform. Various preprocessing pro- 
grams and script editors were used to handle vocabulary files. 

4.1. The CSLU Toolkit 

The CSLU Toolkit [1] was designed not only for speech recog- 
nition, but also for research and educational purposes in the area 
of speech and human-computer interactions. It is developed and 
maintained by the Center of Speech Language Understanding, 
a research centre at the Oregon Graduate Institute of Science 
and Technology, Portland and the Center for Spoken Language 
Research at the University of Colorado. The toolkit, which is 
available free of charge for educational, research, personal, and 
evaluation purposes under a license agreement, supports core 
technologies for speech recognition and speech synthesis, plus 
a graphical based rapid application development environment 
for developing spoken dialogue systems. 

The toolkit supports the development of HMM or 
HMM/ANN hybrid-based speech recognition systems. For this 
purpose it has many modules or tools interacting with each other 
in an environment called CSLU- HMM. The toolkit needs a con- 
sistent organization and naming of directories and files which 
has to be strictly followed. This is tedious work, but also clearly 
doable (still, this might have been the reason why Molalgne de- 
cided that it was not possible to use the CSLU Toolkit [12]). 

4.2. Speech data 

Apart from the specifics of the language itself, the main problem 
with doing speech recognition for an under-resourced language 
like Amharic is the lack of previously available data: No stan- 
dard speech corpus has been developed for Amharic. However, 
we were able to use a corpus of 50 speakers recorded at 16 kHz 



sampling rate by Solomon [10]. 100 different sentences of read 
speech were recorded for each speaker. 

The corpus was prepared and processed using 
Speechview, a part of the CSLU Toolkit providing a 
graphic-based interface to prepare speech data. The tool is used 
to record, display, save, and edit speech signals in their wave 
format. It also provides spectrograms and some other speech 
wave related data like pitch and energy counters, neural net 
outputs, and phonetic labels. With the help of the Speechview 
tool, one can collect and prepare speech data in an easy way 
for training a recognizer. The process of annotating the speech 
waveform, which is the most tedious and difficult process in 
the development of speech recognition systems, can be done at 
different transcription levels. 

Ten spoken sentences each from ten female speakers were 
annotated at the phoneme level for the training corpus and time- 
aligned word level transcriptions were generated automatically. 
Two more speakers were annotated for evaluation purposes. 
Long silences at the beginning and end of the wave file were 
trimmed off and the boundaries of word-level transcriptions 
were adjusted accordingly. 

A vocabulary file was created based on the pronunciation 
of each word in the data set and parts of the phones. This gave a 
vocabulary of 778 words represented by 34 phones that in turn 
were split into 57 phone parts: -F, fi, fr, and fr were defined to 
consist of three parts each; 15 phones have two parts (^, ?°, fi, 
°1, h, ^., *, T, f, -H, ^, T, fi, E', and •T-), while 15 have one part 
only (S, •>, -1', ^, A, cd-, C, h, 'X, -Q, ^, h., h., U, and -rf). Each 
phone group is here ordered internally according to frequency. 

4.3. Experiments 

Thereafter a recognizer was created, the frame vectors were 
generated automatically in the toolkit, and the recognizers was 
trained on the phone part files. The ANN of the recognizer con- 
tained an output layer with the phone parts, while the input layer 
was a 180 node grid representing 20 features each from nine 
time frames (t ± 4* 10ms). 

The recognizer was evaluated on two sentences each from 
ten speakers who were all found in the training data (in total 20 
sentences and 236 words). The results were as shown in Table 1. 
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Table 1: Recognition accuracy on known speakers. 

Best result: 78.56% word and 44.07% sentence level accuracy. 



Table 2: Recognition accuracy on unknown speakers. 

Best result: 74.28% word and 39.70% sentence level accuracy. 
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For each iteration tiie columns in Table 1 give the percentage of 
substitutions, insertions, and deletions, as well as the word ac- 
curacy, and the percentage of correct sentences. The best results 
(78.56% word level accuracy and 44.07% sentence correctness) 
were obtained after 28 iterations. 

When the same recognizer was tested for another ten speak- 
ers who were not included in the training data with two sen- 
tences each (218 words in total), the recognition rate degraded. 
As can be seen in Table 2, the best results were again obtained 
after the 28th iteration. The word accuracy was reduced by 
4.28%, while the sentence level recognition rate was reduced 
by 4.37%, giving a 21.44% word level error rate and 55.93% 
sentence level error rate. 

Accordingly, the HMM/ANN hybrid recognizer gave a 
2.36% decrease in word error rate and 18.01% decrease in sen- 
tence error rate compared to Zegaye's purely HMM-based re- 
cognizer [14], which had 23.80% word and 73.94% sentence 
error rates. The relative error reduction compared to Zegaye's 
work is thus 9.92% at the word level and 24.36% at the sen- 
tence level. 

5. Conclusions 

The paper reported experiences with using the CSLU Toolkit 
to build a hybrid HMM/ANN speaker independent continuous 
speech recognizer for Amharic, the main language of Ethiopia. 
An annotated corpus was created from previously recorded 
speech data. Ten sentences each from twelve speakers were 
marked up at the phoneme level and a vocabulary of 778 words 
was created. 

For speakers found in the training data, the best results ob- 
tained were 78.6% word and 44.1% sentence level accuracy. 
When tested on data from ten previously unseen speakers, the 
recognizer had a 74.3% word accuracy and 39.7% sentence ac- 
curacy; a relative error reduction of 24.4% compared to previ- 
ous work on Amharic, using pure HMM-based methods. 

The CSLU Toolkit proved to be a good vehicle to develop 
hybrid HMM/ANN-based recognizers, and the experiments in- 
dicate that a better recognizer can be developed with further op- 
timization efforts. However, the implementation of the toolkit 
in Windows needs some revisions. There were problems to fully 
download the Toolkit Installer and after installation the system 
integration with Windows required considerable efforts. 
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